COVID-19 deaths in the U.S. have surpassed 256,000, and the virus is now the third leading cause of death (COD) in this country, after heart disease and cancer. Before the pandemic, the U.S. already had a high overall mortality rate, and the gap has widened in the last few decades. In this analysis, we put the pandemic’s toll into perspective by comparing where COVID-19 falls as a leading cause of death in the U.S. and how it has affected the number of deaths in the other 12 causes. The purpose of this causes of death analysis is to understand the burden of mortality that are directly or indirectly attributed to COVID-19. Our project seeks to find if COVID-19 had any affect on the number of deaths in the other 12 causes.
The CDC National Center for Health Statistics (NCHS) collects weekly counts of deaths by state and of select causes that are categorized by underlying cause of death listed in the standardized health care grouping of ICD-10 codes. From 2014 to 2109, there were 12 main COD listed, including leading U.S. killers such as diseases of the heart, diabetes, and lower respiratory. With the onset of COVID-19, two more causes were added: COVID-19 Multiple Cause of Death and COVID-19 Underlying Cause of Death (see Figure 1).
Figure 1: COVID-19 Death Certificate
There are three sources of data we used for this analysis: Weekly Counts of Deaths by State and Select Causes, 2014-2018, Weekly Counts of Deaths by State and Select Causes, 2019-2020, and U.S. 2020 Population Density. We combined CDC weekly death counts into a dataset that represent provisional counts of deaths by the week the deaths occurred, by state of occurrence, and by select underlying causes of death from 2014-2020. The dataset also includes weekly provisional counts of death for COVID-19, coded to ICD-10 code U07.1 as an underlying or multiple cause of death (see Figure 1).
It is worth nothing that our project studies provisional deaths and not final deaths. Provisional deaths are subject to change as more information becomes available and are known to change even after a year of receiving the initial death certificate. Even with this caveat, provisional deaths are the most accurate way to analyze deaths when used in comparison for current year data. In our project we used a 6 week cutoff for our data. While not perfect, it did allow us to include data that may come in on a monthly basis, plus the extra time that is often incurred with COVID-19 deaths. Regarding our analysis, please bear in mind data closer to current date may be less accurate than data from previous months. More information on how the CDC counts deaths can be found on their website.
The source datasets were pretty clean, so only minimal pre-processing was necessary. We saved the cause of death data in two formats: wide and long, where the long format has one row per cause of death for a given state and week. Here are the cleanup steps we took on the source data:
Causes of Death Dataset Cleanup:
U.S. Population Density State Cause of Death Dataset Cleanup:
The COD dataset has 352 rows and 14 columns, where the first column captures the week date. To give you an idea of what the combined cause of death data looked like, here are the first 10 rows that show COVID-19 deaths:
The population density state causes of death dataset has 239,954 rows and 15 columns. This is in a long format, where each row represents a COD for that given week and state. Here is a sample of the data:
To get a sense of the distribution of the deaths across causes you can look at Figure 2, where we show the total death count for the years prior to COVID-19 (2014-2019) versus how it looks in the year 2020. Heart disease and cancer clearly have a large percentage of the deaths, and you can see the impact when COVID-19 showed up in 2020.
Figure 2: Total deaths by cause
To get an overall feel for COVID-19 deaths across the U.S., Figure 3 gives you an idea of the regional distribution. This figure has death counts normalized by population (per capita) to make a more fair comparison. Hovering over an individual state shows the state’s name, number of COVID-19 deaths, and the number of COVID-19 deaths compared to the population of that state. By calculating the COVID-19 deaths per capita, we can see what states are getting hit harder and look for regional and subregional patterns with further analysis. Looking at Figure 3, we can see a strong concentration of per capita COVID-19 deaths for the South and North-East regions.
Figure 3: COVID-19 deaths across U.S. per capita
The main focus of our analysis was comparing the COD over time. In other words, we looked for a correlation between one COD over time with another COD, with special focus being placed on COVID-19. We used four strategies to address this question: Perason correlation coefficients, Granger causality, predicted deaths, and basic descriptive visualizations like time series plots. The time series plots started our analysis by looking for interesting patterns. We then take a more statistical approach which help provide evidence of a relationship between COD and COVID-19 and informed our analytical focus. Then, taking the evidence found in the previous analysis, we look to see if the patterns extend to individual states and regions.
We started our analysis with a simple time series plot, showing the years 2019-2020 (see Figure 4). We wanted to see if any unusual patterns jumped out to us. The first that is easy to see is the fast rise of COVID-19 deaths at the beginning of 2020. We see a few other obvious changes: most COD had a little spike when COVID-19 first started, suggesting that either the other COD just happened to also increase at the same time, or there was a more interesting correlation between COVID-19 and the other COD. Heart disease demonstrates the largest spike during the early stages of the COVID-19 pandemic. There is also an odd rise in unknown COD which is still continuing to rise today.
Figure 4: Causes of Death from 2019-2020
Another way of viewing the COD over time is by percentage: what proportion of all deaths is related to each COD (see Figure 5). This stacked view of the data doesn’t show total counts, but does show the disruption COVID-19 caused to the percentages.
Figure 5: Causes of Death Percentage from 2019-2020
We then started looking at statistical methods to evaluate correlations between COD over time. If this was not dealing with time series data, the default correlation analysis would be Pearson correlation coefficients. However, since time series data has random walk characteristics, that can lead to spurious correlations (not real). You can usually overcome this by taking the difference of the lag values. That is technique we used in our analysis of the year 2020, when COVID-19 death counts were introduced. You can see the Pearson correlation results in Figure 6, where the range is -1 to 1, with values close to 0 being weak correlations. The COD that shows a strong correlation is heart disease (0.7), and a couple that show moderate correlation are diabetes (0.6) and Alzheimer (0.5).
Figure 6: Pearson correlation coefficient for weeks with COVID-19 deaths
One of the most powerful ways to identify a correlation between time series data is using the Granger causality test. It is related to the time series modeling technique called Vector autogression (VAR). VAR is a multivariate time series technique that can predict future values by using two or more autoregressive variables. In simple terms, the lags of two time series can be used to predict one time series. For example, if you know the price of gold and oil yesterday, you can better predict the price of gold tomorrow knowing both previous prices. Related to this analysis of COVID-19, does COVID-19 death count help predict the death count of another COD? This also applies the other way around: does another COD help predict COVID-19? This is where the “causality” part of the Granger causality test name comes from. You can actually test for directional correlation. However, do not be mislead by the term “causality”, as this test does not prove causality, only correlation. A little bit of trivia for those interested, Clive Granger and his co-winner Robert F. Engle won a Nobel prize for their work in macroeconomic analysis that used this Granger causality test.
There are just a few more process steps to share before showing the results. First, only weeks with COVID-19 deaths are used in this data, since that is our primary interest in this analysis. Second, an automated selection of autoregressive terms was used based on the Akaike information criterion (AIC). The number of lags is important, because as mentioned previously, we need to build a VAR model using a combination of all pairs of time series. The result of our Granger analysis can be seen in Figure 7, where we show all Granger causality relationships with COVID-19. Note, the p-value is what determines if the relationship is significant (0.05 is highlighted by a vertical reference guide). Based on the results of the Granger causality test, only diabetes and heart disease show statistical significance (< 0.05), with cancer and septicemia close behind (< 0.07).
Figure 7: Granger causality for weeks with COVID-19 deaths
Another way to show correlations is by predicting what we think the death count for each COD should have been in 2020, and then compare the prediction with the actual death count. If we see unusual behavior during the COVID-19 pandemic, then that suggests there is a correlation. For that purpose, we first created a predictive model for the non-COVID-19 COD. We used a time series modeling technique called autoregressive integrated moving average (ARIMA). After we had the models created, we predicted what the death count should have been in 2020. Comparing the predictions and a 95% confidence interval with the actual death counts, we end up with Figure 8. A few COD clearly show some unusual activity: Alzheimer, heart disease, diabetes, and unknown causes. This gives us additional evidence that COVID-19 is correlated to heart disease and diabetes, as was identified in our Pearson correlation (see Figure 6) and Granger analysis (see Figure 7). Alzheimer had a Granger p-value of 0.25, so it isn’t significant, but it still is supportive of a correlation. Unknown causes has a pattern that can’t directly be understood given its wide coverage of conditions. We aren’t completely sure why it shows such unusual behavior. However, whatever the cause, we don’t believe it is related to COVID-19 (at least not directly), since neither Pearson correlations nor Granger causality show any correlation.
Figure 8: Predicted Deaths vs Actual Deaths
Our map in Figure 3 showed that patterns emerged when evaluating by region. Furthermore, when evaluating by individual states or subregions for multiple years, the plots become unwieldy. In Figure 9, we have set the years from 2018-2020 to give the optimal visual display for comparison, while eliminating the some of the redundancy of similarities between 2014-2019.
Evaluating causes of death by region and year gives us an additional view of how COVID-19 affects the different causes of death. Viewing all years reveals a relatively stable disbursement of the percentages between the causes of death between 2014 and 2019. However, as Figure 9 shows, we start to see a different picture when we view 2018-2020. As expected, this is most notable in causes that already have a high percentage of the total deaths, such as heart disease and cancer, but what is additionally interesting is that the change in percentage is not uniform across regions. For example, the change in percentage for heart disease and Alzheimer’s between 2019 and 2020 in the NorthEast is much higher than the other 3 regions, as is their COVID-19 deaths. And while unknown/other deaths increased across all regions, we see a very unusual jump in the West. The number of cancer deaths hold fairly stable across the year when viewing by region.
By hovering over each bar, you are able to see Cause of death, the percent of deaths for that cause, the total number of deaths for that cause, and the total number of deaths for all causes.
Figure 9: Causes of Death by Region and Year
*Does not include Puerto Rico
While it may seem that Figure 7 shows causes of death going down and other plots showing them going up, that is because the cause of death is reflected in percentage. To verify this we can look at the example of cause of death for Heart Disease in the South across the years. As you can see in table 1 below, (codTotal = cause of death total and YRDeathTotal = total deaths for the year for that region), the overall deaths went drastically up for the same portion of a year (due to COVID-19 deaths), but the rate of heart disease only slightly increased comparatively. However heart disease does show some abnormal upticks, so deeper analysis with controls for population and additional years is needed.
As stated in the introduction, our project sought to find if COVID-19 had any affect on the number of deaths in the other 12 main causes of death. To determine this, we looked at several different methods: Granger Causality, Pearson Correlation, the number of deaths in linear format as related to the causes over time, the percentage of death for each of causes of death over time in stacked format, causes of death by prediction to determine excess deaths over predicted deaths, and geospatial plots to see how deaths are regionally dispersed and then drilling down to how the causes of death changed with the addition of COVID-19.
Some of our plots, like our Causes of Death over Time, were inconclusive about affect, but did show interesting patterns of the ebb-and-flow of each cause when the timeline was expanded that is not readily seen when the timeline was shorten. Likewise, our Pearson Correlation showed a slight relationship with covid-19, heart disease, and diabeties when comparing 2019 and 2020, but that relationship declined as more years were added. The application of Pearson Correlation was questionable with our dataset, thus we do not consider this indicative of our conclusion.
Once we move to our stacked plot showing Causes of Death Percentage over time, we do see a distinct visual representation of how COVID-19 affected other causes of death (both negatively and positively) with heart disease, cancer, influenza, and unknown* cause standing out. Combine that with our Granger Causality results, where heart disease, diabetes, and septicemia show a causal relationship.
Moving to our Cause of Death by Predictions plots, we see a much deeper picture of which causes of death COVID-19 is affecting with increased/decreased count of deaths. As with the other plots, heart disease shows a strong relationship, with a higher than predicted expected overall death count, particularly during the April/May spike of deaths. Diabetes, cerebrovascular, kidney disease, and interestingly Alzhiemer’s/Dementia causes of death follow a similar pattern. Further research into Alzheimer’s shows that the disease contributes to conditions such as pneumonia, heart failure, and infections in general, thus a propensity for multiple levels of diminished health may be a contributor here. A strong spike in unknown* deaths is particularly striking. Influenza, lower respiratory, and septicemia appear to have a negative effect and it would be interesting to combine these results with a reduced mobility analysis. the category of other respiratory and cancer seem to be more indeterminate and it is possible that later data will show a pattern or comparing the high and lows for multiple years will yield additional information for these categories.
Our United States map does not show an affect of the different causes on each other, but it does help to establish areas that are data rich with higher cases per capita to look into more closely. We particularly see a current trend in the North-east and this coincides with our Causes of Death by Region (with population normalized) plots in figure 9. The North-east particularly shows COVID-19 deaths affecting several other causes of death. The stability of cancer coincides with our other plots, and indicates that there may not be an increase of risk for a majority of cancer patients.
Our final results dictate a need for further investigation. (List here the causes that were affected, positively, negatively, and indeterminate.)
LIST HERE: list of further research.
*Unknown here does not mean cause of death is unknown. Some of these deaths are truly unknown, but many of them fall into ICD-10 codes that individually do not make up a statistically significant bucket. As such, CDC and many other sources will often list them together. For further information on what these deaths represent, please visit here.